Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# this will help in making the Python code more structured automatically (good coding practice)
#%load_ext nb_black
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Set the Seaborn default theme
sns.set_theme()
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to perform PCA
from sklearn.decomposition import PCA
import warnings
# Suppress all warnings
warnings.filterwarnings("ignore")
# Connecting google drive to google colab
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Loading stock data
stockdata = pd.read_csv('/content/drive/My Drive/Colab_Notebooks/unsupervised_learning/project_7/stock_data.csv')
stockdata.head(10)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
| 5 | ADM | Archer-Daniels-Midland Co | Consumer Staples | Agricultural Products | 36.680000 | -12.017268 | 1.516493 | 10 | 49 | -189000000 | 1849000000 | 2.99 | 6.183946e+08 | 12.267559 | 7.496831 |
| 6 | ADS | Alliance Data Systems | Information Technology | Data Processing & Outsourced Services | 276.570007 | 6.189286 | 1.116976 | 30 | 25 | 90885000 | 596541000 | 8.91 | 6.695185e+07 | 31.040405 | 129.064585 |
| 7 | AEE | Ameren Corp | Utilities | MultiUtilities | 43.230000 | 2.174424 | 1.124186 | 9 | 14 | 287000000 | 636000000 | 2.60 | 2.446154e+08 | 16.626923 | -0.719497 |
| 8 | AEP | American Electric Power | Utilities | Electric Utilities | 58.270000 | 2.371753 | 1.068485 | 11 | 9 | 13900000 | 2052300000 | 3.13 | 4.218978e+08 | 18.456543 | -3.022649 |
| 9 | AFL | AFLAC Inc | Financials | Life & Health Insurance | 59.900002 | 3.027181 | 1.048295 | 14 | 99 | -308000000 | 2533000000 | 5.88 | 4.307823e+08 | 10.187075 | -1.883912 |
stockdata.shape
(340, 15)
There are 340 rows and 15 columns in the stock dataframe provided by Trade&Ahead.
# viewing a random sample of the dataset
stockdata.sample(n=10, random_state=1)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 102 | DVN | Devon Energy Corp. | Energy | Oil & Gas Exploration & Production | 32.000000 | -15.478079 | 2.923698 | 205 | 70 | 830000000 | -14454000000 | -35.55 | 4.065823e+08 | 93.089287 | 1.785616 |
| 125 | FB | Information Technology | Internet Software & Services | 104.660004 | 16.224320 | 1.320606 | 8 | 958 | 592000000 | 3669000000 | 1.31 | 2.800763e+09 | 79.893133 | 5.884467 | |
| 11 | AIV | Apartment Investment & Mgmt | Real Estate | REITs | 40.029999 | 7.578608 | 1.163334 | 15 | 47 | 21818000 | 248710000 | 1.52 | 1.636250e+08 | 26.335526 | -1.269332 |
| 248 | PG | Procter & Gamble | Consumer Staples | Personal Products | 79.410004 | 10.660538 | 0.806056 | 17 | 129 | 160383000 | 636056000 | 3.28 | 4.913916e+08 | 24.070121 | -2.256747 |
| 238 | OXY | Occidental Petroleum | Energy | Oil & Gas Exploration & Production | 67.610001 | 0.865287 | 1.589520 | 32 | 64 | -588000000 | -7829000000 | -10.23 | 7.652981e+08 | 93.089287 | 3.345102 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 4.353535e+08 | 17.682214 | -3.838260 |
| 112 | EQT | EQT Corporation | Energy | Oil & Gas Exploration & Production | 52.130001 | -21.253771 | 2.364883 | 2 | 201 | 523803000 | 85171000 | 0.56 | 1.520911e+08 | 93.089287 | 9.567952 |
| 147 | HAL | Halliburton Co. | Energy | Oil & Gas Equipment & Services | 34.040001 | -5.101751 | 1.966062 | 4 | 189 | 7786000000 | -671000000 | -0.79 | 8.493671e+08 | 93.089287 | 17.345857 |
| 89 | DFS | Discover Financial Services | Financials | Consumer Finance | 53.619999 | 3.653584 | 1.159897 | 20 | 99 | 2288000000 | 2297000000 | 5.14 | 4.468872e+08 | 10.431906 | -0.375934 |
| 173 | IVZ | Invesco Ltd. | Financials | Asset Management & Custody Banks | 33.480000 | 7.067477 | 1.580839 | 12 | 67 | 412000000 | 968100000 | 2.26 | 4.283628e+08 | 14.814159 | 4.218620 |
This is a random selection of data presented in the table above. The presentation looks good but the column titles require a cleanup
# copying the data to another variable to avoid any changes to original data
df = stockdata.copy()
# fixing column names
df.columns = [c.replace(" ", "_") for c in df.columns]
# dropping the ticker symbol column as it is not relevant at this point.
df.drop("Ticker_Symbol", axis=1, inplace=True)
Checking the data types of the columns for the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Security 340 non-null object 1 GICS_Sector 340 non-null object 2 GICS_Sub_Industry 340 non-null object 3 Current_Price 340 non-null float64 4 Price_Change 340 non-null float64 5 Volatility 340 non-null float64 6 ROE 340 non-null int64 7 Cash_Ratio 340 non-null int64 8 Net_Cash_Flow 340 non-null int64 9 Net_Income 340 non-null int64 10 Earnings_Per_Share 340 non-null float64 11 Estimated_Shares_Outstanding 340 non-null float64 12 P/E_Ratio 340 non-null float64 13 P/B_Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(3) memory usage: 37.3+ KB
After dropping the Ticker symbol column, now we have 14 columns. The rows remians 340 in total while the datatypes include objects, floats and integers.
Let's take a look at the summary of the data
df.describe()
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 340.000000 | 340.000000 | 340.000000 | 340.000000 | 340.000000 | 3.400000e+02 | 3.400000e+02 | 340.000000 | 3.400000e+02 | 340.000000 | 340.000000 |
| mean | 80.862345 | 4.078194 | 1.525976 | 39.597059 | 70.023529 | 5.553762e+07 | 1.494385e+09 | 2.776662 | 5.770283e+08 | 32.612563 | -1.718249 |
| std | 98.055086 | 12.006338 | 0.591798 | 96.547538 | 90.421331 | 1.946365e+09 | 3.940150e+09 | 6.587779 | 8.458496e+08 | 44.348731 | 13.966912 |
| min | 4.500000 | -47.129693 | 0.733163 | 1.000000 | 0.000000 | -1.120800e+10 | -2.352800e+10 | -61.200000 | 2.767216e+07 | 2.935451 | -76.119077 |
| 25% | 38.555000 | -0.939484 | 1.134878 | 9.750000 | 18.000000 | -1.939065e+08 | 3.523012e+08 | 1.557500 | 1.588482e+08 | 15.044653 | -4.352056 |
| 50% | 59.705000 | 4.819505 | 1.385593 | 15.000000 | 47.000000 | 2.098000e+06 | 7.073360e+08 | 2.895000 | 3.096751e+08 | 20.819876 | -1.067170 |
| 75% | 92.880001 | 10.695493 | 1.695549 | 27.000000 | 99.000000 | 1.698108e+08 | 1.899000e+09 | 4.620000 | 5.731175e+08 | 31.764755 | 3.917066 |
| max | 1274.949951 | 55.051683 | 4.580042 | 917.000000 | 958.000000 | 2.076400e+10 | 2.444200e+10 | 50.090000 | 6.159292e+09 | 528.039074 | 129.064585 |
The minimum cash ratio is zero.
Let's check for missing values since the minimum cash ratio is zero
# checking for missing values
df.isna().sum()
Security 0 GICS_Sector 0 GICS_Sub_Industry 0 Current_Price 0 Price_Change 0 Volatility 0 ROE 0 Cash_Ratio 0 Net_Cash_Flow 0 Net_Income 0 Earnings_Per_Share 0 Estimated_Shares_Outstanding 0 P/E_Ratio 0 P/B_Ratio 0 dtype: int64
Removing duplicate entries
# lets check duplicate observations
df.duplicated().sum()
0
There are no duplicate entries.
Questions:
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
num_cols = ['Current_Price', 'Price_Change', 'Volatility', 'ROE', 'Cash_Ratio', 'Net_Cash_Flow', 'Net_Income', 'Earnings_Per_Share', 'Estimated_Shares_Outstanding', 'P/E_Ratio', 'P/B_Ratio']
for num in num_cols:
histogram_boxplot(df, num, bins=50, kde=True, figsize=(15, 8));
plt.show();
The stock price is significantly skewed to the right with a long tail indicating the presence of outliers.
The prices typically range between zero and 150 but goes higher that 1200.
The median is about a 50 and the average price is about 80, all in dollars
Price change, net cash flow, earnings per share, P/B ratio is normally distributed.
Volatility is skewed to the right with a long tail.
ROE, cash ratio, estimated shares outstanding, P/E ratio is signficantly skewed to the right with a long tail
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(df, 'GICS_Sector', perc=True)
Industrial stocks coomprise the most in the GIC sector, followed by financial stocks.
labeled_barplot(df, 'GICS_Sub_Industry', perc=True)
Subsector stocks dominiating include Oil& Gas Exploration & Production, REITS, Industrial Conglomerates, Electric utilities, Internet Software & Services, Health Care Equipment, MultiUtilities, and Banks
# Grouping by sector and calculate maximum price increase on average
avg_max_price_inc = df.groupby('GICS_Sector')['Price_Change'].mean().reset_index()
avg_max_price_inc
| GICS_Sector | Price_Change | |
|---|---|---|
| 0 | Consumer Discretionary | 5.846093 |
| 1 | Consumer Staples | 8.684750 |
| 2 | Energy | -10.228289 |
| 3 | Financials | 3.865406 |
| 4 | Health Care | 9.585652 |
| 5 | Industrials | 2.833127 |
| 6 | Information Technology | 7.217476 |
| 7 | Materials | 5.589738 |
| 8 | Real Estate | 6.205548 |
| 9 | Telecommunications Services | 6.956980 |
| 10 | Utilities | 0.803657 |
The sectors that had the maximum average price increase were in Health Care.
pd.crosstab(df.GICS_Sub_Industry, df.GICS_Sector).style.highlight_max(
color="lightgreen", axis=0
)
| GICS_Sector | Consumer Discretionary | Consumer Staples | Energy | Financials | Health Care | Industrials | Information Technology | Materials | Real Estate | Telecommunications Services | Utilities |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GICS_Sub_Industry | |||||||||||
| Advertising | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Aerospace & Defense | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Agricultural Products | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Air Freight & Logistics | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 |
| Airlines | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 |
| Alternative Carriers | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Apparel, Accessories & Luxury Goods | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Application Software | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| Asset Management & Custody Banks | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Auto Parts & Equipment | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Automobile Manufacturers | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Banks | 0 | 0 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Biotechnology | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 |
| Brewers | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Broadcasting & Cable TV | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Building Products | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Cable & Satellite | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Casinos & Gaming | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Computer Hardware | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Construction & Farm Machinery & Heavy Trucks | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 |
| Construction Materials | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
| Consumer Electronics | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Consumer Finance | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Copper | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Data Processing & Outsourced Services | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| Distributors | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Diversified Chemicals | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 |
| Diversified Commercial Services | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Diversified Financial Services | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Drug Retail | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Electric Utilities | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 |
| Electrical Components & Equipment | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Electronic Components | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| Electronic Equipment & Instruments | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Environmental Services | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Fertilizers & Agricultural Chemicals | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
| Financial Exchanges & Data | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Gold | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Health Care Distributors | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| Health Care Equipment | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 0 |
| Health Care Facilities | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| Health Care Supplies | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| Home Entertainment Software | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Home Furnishings | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Homebuilding | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Hotels, Resorts & Cruise Lines | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Household Appliances | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Household Products | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Housewares & Specialties | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Human Resource & Employment Services | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| IT Consulting & Other Services | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
| Industrial Conglomerates | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 | 0 | 0 | 0 |
| Industrial Gases | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Industrial Machinery | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 |
| Industrial Materials | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Insurance Brokers | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Integrated Oil & Gas | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Integrated Telecommunications Services | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 |
| Internet & Direct Marketing Retail | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Internet Software & Services | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 0 | 0 |
| Investment Banking & Brokerage | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Leisure Products | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Life & Health Insurance | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Life Sciences Tools & Services | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| Managed Health Care | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| Metal & Glass Containers | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Motorcycle Manufacturers | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Multi-Sector Holdings | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Multi-line Insurance | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| MultiUtilities | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 |
| Networking Equipment | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Office REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Oil & Gas Equipment & Services | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Oil & Gas Exploration & Production | 0 | 0 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Oil & Gas Refining & Marketing & Transportation | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Packaged Foods & Meats | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Paper Packaging | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
| Personal Products | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pharmaceuticals | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 |
| Property & Casualty Insurance | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Publishing | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 |
| Railroads | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Real Estate Services | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Regional Banks | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Research & Consulting Services | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Residential REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 |
| Restaurants | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Retail REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 |
| Semiconductor Equipment | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Semiconductors | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 |
| Soft Drinks | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Specialized REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
| Specialty Chemicals | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 |
| Specialty Retail | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Specialty Stores | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Steel | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Technology Hardware, Storage & Peripherals | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Technology, Hardware, Software and Supplies | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Thrifts & Mortgage Finance | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tires & Rubber | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tobacco | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Trucking | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Water Utilities | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Most security fall under Oil & Gas Exploration & Production sub industry & energy industry.
# Bivariate analysis
# Stock Price_Change Vs. GICS_Sector
plt.figure(figsize=(20,6))
sns.boxplot(data = df, y = "Price_Change", x = "GICS_Sector");
# Cash_Ratio Vs. GICS_Sector
plt.figure(figsize=(20,8))
sns.boxplot(data = df, y = "Cash_Ratio", x = "GICS_Sector");
Utilities securities had the least cash ratio.
# P/E_Ratio Vs. GICS_Sector
plt.figure(figsize=(20,8))
sns.boxplot(data = df, y = "P/E_Ratio", x = "GICS_Sector");
plt.figure(figsize=(15, 7))
sns.heatmap(df[num_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Net income has a positive relationship with earnings pershare and estimated shares outstanding
fig, axes = plt.subplots(5, 2, figsize=(20, 15))
fig.suptitle("CDF plot of numerical variables", fontsize=20)
counter = 0
for ii in range(5):
sns.ecdfplot(ax=axes[ii][0], x=df[num_cols[counter]])
counter = counter + 1
if counter != 11:
sns.ecdfplot(ax=axes[ii][1], x=df[num_cols[counter]])
counter = counter + 1
else:
pass
fig.tight_layout(pad=2.0)
# Pair-plot analysis
sns.pairplot(df[num_cols],diag_kind="kde");
Most of the data is normally distributed.
# Scaling the data set before clustering
scaler = StandardScaler()
subset = df[num_cols].copy()
subset_scaled = scaler.fit_transform(subset)
# Creating a dataframe from the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
subset_scaled_df
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335 | -0.486181 | 0.901646 | 0.540121 | -0.255142 | 4.308162 | -0.559673 | -1.487784 | -1.127481 | 0.429111 | -0.082116 | 0.572194 |
| 336 | -0.289510 | -1.065766 | -0.079703 | 1.062211 | -0.476513 | 0.053235 | -0.051186 | 0.029391 | -0.167741 | -0.337154 | -0.152012 |
| 337 | 0.221913 | 0.439539 | -0.206067 | -0.400362 | 0.332009 | 0.164889 | -0.342467 | -0.303532 | -0.460058 | 2.233634 | -1.589390 |
| 338 | -0.547053 | -0.436811 | -0.097813 | -0.369243 | 0.320933 | -0.051022 | -0.301171 | -0.239684 | -0.377852 | -0.222714 | 0.118680 |
| 339 | -0.336453 | 1.051046 | 0.142671 | -0.078803 | -0.055639 | 0.111378 | -0.293666 | -0.318734 | -0.092942 | 0.854902 | 0.246754 |
340 rows × 11 columns
# Pair-plot analysis
sns.pairplot(subset_scaled_df ,diag_kind="kde");
Allll attributes are standardized to a common scale, with an average of 0 and a standard deviation of 1. The dataset is devoid of any missing values or duplicated entries. While outliers have been detected, they have not undergone any correction.
clusters = range(1, 9)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k)
model.fit(subset_scaled_df)
prediction = model.predict(subset_scaled_df)
distortion = (
sum(
np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1)
)
/ subset_scaled_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average distortion")
plt.title("Selecting k with the Elbow Method")
plt.show()
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.3862098789299604 Number of Clusters: 3 Average Distortion: 2.2652173220958 Number of Clusters: 4 Average Distortion: 2.176396791566185 Number of Clusters: 5 Average Distortion: 2.112997232924101 Number of Clusters: 6 Average Distortion: 2.0574742484876594 Number of Clusters: 7 Average Distortion: 2.0316763803182827 Number of Clusters: 8 Average Distortion: 1.974528327675814
Approriate value for K seems to be between 3, 4 or 5.
sil_score = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict((subset_scaled_df))
# centers = clusterer.cluster_centers_
score = silhouette_score(subset_scaled_df, preds)
sil_score.append(score)
print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
For n_clusters = 2, silhouette score is 0.43969639509980457 For n_clusters = 3, silhouette score is 0.45797710447228496 For n_clusters = 4, silhouette score is 0.4495910932182279 For n_clusters = 5, silhouette score is 0.4201772228010058 For n_clusters = 6, silhouette score is 0.39935785683828356 For n_clusters = 7, silhouette score is 0.400894762440915 For n_clusters = 8, silhouette score is 0.4213918040275059 For n_clusters = 9, silhouette score is 0.31844099775524826
[<matplotlib.lines.Line2D at 0x7ddde0d5fe80>]
Four is more appropriate using the silhouette score.
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 7 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
Let's take 4 as the appropriate no. of clusters as the silhouette score is high enough and there is knick at 5 in the elbow curve.
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(subset_scaled_df)
KMeans(n_clusters=4, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=4, random_state=0)
# adding kmeans cluster labels to the original dataframe
df["K_means_segments"] = kmeans.labels_
subset_scaled_df['K_means_segments'] = kmeans.labels_
Cluster Profiling
cluster_profile = df.groupby('K_means_segments').mean()
cluster_profile['count_in_each_segments'] = df.groupby('K_means_segments')['Security'].count().values
cluster_profile
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments | ||||||||||||
| 0 | 234.170932 | 13.400685 | 1.729989 | 25.600000 | 277.640000 | 1.554927e+09 | 1.572612e+09 | 6.045200 | 5.783163e+08 | 74.960824 | 14.402452 | 25 |
| 1 | 38.099260 | -15.370329 | 2.910500 | 107.074074 | 50.037037 | -1.594285e+08 | -3.887458e+09 | -9.473704 | 4.803986e+08 | 90.619220 | 1.342067 | 27 |
| 2 | 50.517273 | 5.747586 | 1.130399 | 31.090909 | 75.909091 | -1.072273e+09 | 1.483309e+10 | 4.154545 | 4.298827e+09 | 14.803577 | -4.552119 | 11 |
| 3 | 72.399112 | 5.066225 | 1.388319 | 34.620939 | 53.000000 | -1.404622e+07 | 1.482212e+09 | 3.621029 | 4.385338e+08 | 23.843656 | -3.358948 | 277 |
fig, axes = plt.subplots(3, 4, figsize=(20, 16))
fig.suptitle('Boxplot of numerical variables for each cluster', fontsize=20)
counter = 0
for ii in range(3):
for jj in range(4):
if counter <11:
sns.boxplot(ax=axes[ii, jj],y=subset_scaled_df[num_cols[counter]],x=subset_scaled_df['K_means_segments'])
counter = counter+1
fig.tight_layout(pad=2.0)
# Comparing cluster vs. GICS_Sector
pd.crosstab(df.GICS_Sector, df.K_means_segments).style.highlight_max(color = 'lightgreen', axis = 0)
| K_means_segments | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| GICS_Sector | ||||
| Consumer Discretionary | 6 | 0 | 1 | 33 |
| Consumer Staples | 1 | 0 | 1 | 17 |
| Energy | 1 | 22 | 1 | 6 |
| Financials | 1 | 0 | 3 | 45 |
| Health Care | 9 | 0 | 2 | 29 |
| Industrials | 0 | 1 | 0 | 52 |
| Information Technology | 5 | 3 | 1 | 24 |
| Materials | 0 | 1 | 0 | 19 |
| Real Estate | 1 | 0 | 0 | 26 |
| Telecommunications Services | 1 | 0 | 2 | 2 |
| Utilities | 0 | 0 | 0 | 24 |
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9361573137155428 Cophenetic correlation for Euclidean distance and complete linkage is 0.8912171674298116 Cophenetic correlation for Euclidean distance and average linkage is 0.9402170033461534 Cophenetic correlation for Euclidean distance and weighted linkage is 0.891382329460296 Cophenetic correlation for Chebyshev distance and single linkage is 0.9187458528921174 Cophenetic correlation for Chebyshev distance and complete linkage is 0.8228080389704356 Cophenetic correlation for Chebyshev distance and average linkage is 0.9374227213554178 Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9145555208649849 Cophenetic correlation for Mahalanobis distance and single linkage is 0.9428343342402944 Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7430655304868364 Cophenetic correlation for Mahalanobis distance and average linkage is 0.9442308978857612 Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.9064227595906567 Cophenetic correlation for Cityblock distance and single linkage is 0.9510044191059339 Cophenetic correlation for Cityblock distance and complete linkage is 0.8283141710925163 Cophenetic correlation for Cityblock distance and average linkage is 0.9203303446749941 Cophenetic correlation for Cityblock distance and weighted linkage is 0.8205165624117894
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9510044191059339, which is obtained with Cityblock distance and single linkage
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print("Cophenetic correlation for {} linkage is {}".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
Cophenetic correlation for single linkage is 0.9361573137155428 Cophenetic correlation for complete linkage is 0.8912171674298116 Cophenetic correlation for average linkage is 0.9402170033461534 Cophenetic correlation for centroid linkage is 0.9350232346888858 Cophenetic correlation for ward linkage is 0.7636969819420432 Cophenetic correlation for weighted linkage is 0.891382329460296
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9402170033461534, which is obtained with average linkage
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(20, 40))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(subset_scaled_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(subset_scaled_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Cophenetic correlation is highest with average and centroid linkages
6 appears to be the appropriate number of clusters from the dendrogram for average linkage
HCmodel = AgglomerativeClustering(n_clusters=6, affinity="euclidean", linkage="average")
HCmodel.fit(subset_scaled_df)
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=6)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=6)
subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df["HC_Clusters"] = HCmodel.labels_
cluster_profile = df.groupby("HC_Clusters").mean()
cluster_profile["count_in_each_segments"] = (
df.groupby("HC_Clusters")["Security"].count().values
)
cluster_profile
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | K_means_segments | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | |||||||||||||
| 0 | 77.287589 | 4.099730 | 1.518066 | 35.336336 | 66.900901 | -3.319732e+07 | 1.538075e+09 | 2.88527 | 5.605050e+08 | 32.441706 | -2.174921 | 2.630631 | 333 |
| 1 | 25.640000 | 11.237908 | 1.322355 | 12.500000 | 130.500000 | 1.675550e+10 | 1.365400e+10 | 3.29500 | 2.791829e+09 | 13.649696 | 1.508484 | 1.000000 | 2 |
| 2 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1.292500e+09 | -1.910650e+10 | -41.81500 | 5.195740e+08 | 60.748608 | 1.565141 | 1.000000 | 2 |
| 3 | 104.660004 | 16.224320 | 1.320606 | 8.000000 | 958.000000 | 5.920000e+08 | 3.669000e+09 | 1.31000 | 2.800763e+09 | 79.893133 | 5.884467 | 0.000000 | 1 |
| 4 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1.671386e+09 | 2.551360e+09 | 50.09000 | 5.093552e+07 | 25.453183 | -1.052429 | 0.000000 | 1 |
| 5 | 276.570007 | 6.189286 | 1.116976 | 30.000000 | 25.000000 | 9.088500e+07 | 5.965410e+08 | 8.91000 | 6.695185e+07 | 31.040405 | 129.064585 | 0.000000 | 1 |
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)
AgglomerativeClustering(affinity='euclidean', n_clusters=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', n_clusters=4)
subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df["HC_Clusters"] = HCmodel.labels_
cluster_profile = df.groupby("HC_Clusters").mean()
cluster_profile["count_in_each_segments"] = (
df.groupby("HC_Clusters")["Security"].count().values
)
cluster_profile
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | K_means_segments | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | |||||||||||||
| 0 | 46.558126 | -11.798670 | 2.617878 | 178.750000 | 50.250000 | 4.349716e+07 | -3.197472e+09 | -7.785312 | 4.732895e+08 | 72.496532 | -0.780467 | 1.281250 | 32 |
| 1 | 71.846974 | 4.953643 | 1.392784 | 25.117216 | 53.831502 | 1.197788e+06 | 1.557674e+09 | 3.691044 | 4.439183e+08 | 23.583804 | -3.087957 | 2.996337 | 273 |
| 2 | 229.579357 | 14.049986 | 1.735216 | 25.423077 | 268.423077 | 1.712688e+09 | 1.981882e+09 | 5.946923 | 7.219242e+08 | 84.216911 | 13.114240 | 0.115385 | 26 |
| 3 | 46.672222 | 5.166566 | 1.079367 | 25.000000 | 58.333333 | -3.040667e+09 | 1.484844e+10 | 3.435556 | 4.564960e+09 | 15.596051 | -6.354193 | 2.000000 | 9 |
# let's see the names of the securities in each cluster
for cl in df["HC_Clusters"].unique():
print(
"The",
df[df["HC_Clusters"] == cl]["Security"].nunique(),
"Securities in cluster",
cl,
"are:",
)
print(df[df["HC_Clusters"] == cl]["Security"].unique())
print("-" * 100, "\n")
The 273 Securities in cluster 1 are: ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'CIGNA Corp.' 'Cincinnati Financial' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'General Dynamics' 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Spectra Energy Corp.' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] ---------------------------------------------------------------------------------------------------- The 26 Securities in cluster 2 are: ['Analog Devices, Inc.' 'Alliance Data Systems' 'Alexion Pharmaceuticals' 'Amgen Inc' 'Amazon.com Inc' 'Bank of America Corp' 'BIOGEN IDEC Inc.' 'Celgene Corp.' 'Chipotle Mexican Grill' 'Equinix' 'Edwards Lifesciences' 'Facebook' 'First Solar Inc' 'Frontier Communications' 'Intel Corp.' 'Intuitive Surgical Inc.' "McDonald's Corp." 'Monster Beverage' 'Netflix Inc.' 'Priceline.com Inc' 'Regeneron' 'TripAdvisor' 'Vertex Pharmaceuticals Inc' 'Waters Corporation' 'Wynn Resorts Ltd' 'Yahoo Inc.'] ---------------------------------------------------------------------------------------------------- The 32 Securities in cluster 0 are: ['Allegion' 'Apache Corporation' 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Chesapeake Energy' 'Charter Communications' 'Colgate-Palmolive' 'Cabot Oil & Gas' 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'EQT Corporation' 'Freeport-McMoran Cp & Gld' 'Halliburton Co.' 'Hess Corporation' 'Hewlett Packard Enterprise' 'Kimberly-Clark' 'Kinder Morgan' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.' 'S&P Global, Inc.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy'] ---------------------------------------------------------------------------------------------------- The 9 Securities in cluster 3 are: ['Citigroup Inc.' 'Ford Motor' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.'] ----------------------------------------------------------------------------------------------------
fig, axes = plt.subplots(3, 4, figsize=(20, 16))
fig.suptitle('Boxplot of numerical variables for each cluster', fontsize=20)
counter = 0
for ii in range(3):
for jj in range(4):
if counter <11:
sns.boxplot(ax=axes[ii, jj],y=subset_scaled_df[num_cols[counter]],x=subset_scaled_df['HC_Clusters'])
counter = counter+1
fig.tight_layout(pad=2.0)
# Comparing cluster vs. GICS_Sector
pd.crosstab(df.GICS_Sector, df.HC_Clusters).style.highlight_max(color = 'lightgreen', axis = 0)
| HC_Clusters | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| GICS_Sector | ||||
| Consumer Discretionary | 1 | 32 | 6 | 1 |
| Consumer Staples | 2 | 15 | 1 | 1 |
| Energy | 23 | 6 | 0 | 1 |
| Financials | 1 | 44 | 1 | 3 |
| Health Care | 0 | 30 | 9 | 1 |
| Industrials | 2 | 51 | 0 | 0 |
| Information Technology | 2 | 24 | 7 | 0 |
| Materials | 1 | 19 | 0 | 0 |
| Real Estate | 0 | 26 | 1 | 0 |
| Telecommunications Services | 0 | 2 | 1 | 2 |
| Utilities | 0 | 24 | 0 | 0 |
INSIGHT
You compare several things, like:
You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.
# Comparing Hierarchical cluster vs. K-means cluster
pd.crosstab(df.K_means_segments, df.HC_Clusters).style.highlight_max(color = 'lightgreen', axis = 0)
| HC_Clusters | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| K_means_segments | ||||
| 0 | 1 | 0 | 24 | 0 |
| 1 | 26 | 0 | 1 | 0 |
| 2 | 0 | 1 | 1 | 9 |
| 3 | 5 | 272 | 0 | 0 |
# PCA to reduce the data to two dimensions and visualize it to see how well-separated the clusters are
# setting the number of components to 2
pca = PCA(n_components=2)
# transforming data and storing results in a dataframe
X_reduced_pca = pca.fit_transform(subset_scaled_df)
reduced_df_pca = pd.DataFrame(
data=X_reduced_pca, columns=["Component 1", "Component 2"]
)
# checking the amount of variance explained
pca.explained_variance_ratio_.sum()
0.38933325456234846
#Scatterplot by cluster labels - K-Means Clustering
sns.scatterplot(
data=reduced_df_pca,
x="Component 1",
y="Component 2",
hue=df["K_means_segments"],
palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
<matplotlib.legend.Legend at 0x7ddde21889d0>
# Scatterplot by cluster labels - HierarchicalClustering
sns.scatterplot(
data=reduced_df_pca,
x="Component 1",
y="Component 2",
hue=df["HC_Clusters"],
palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
<matplotlib.legend.Legend at 0x7ddde137b7c0>
Exploratory Data Analysis Findings:
Stock prices and Estimated Shares Outstanding exhibit right-skewed distributions with positive outliers. Health Care and Financial sectors had notable positive Price Changes, making them attractive to investors. Information Technology and Financial sectors stand out with high Cash Ratios, enhancing their appeal. Real Estate sector offers stable investment due to consistent Price Change and Cash Ratio. Energy sector is volatile with high Price Change variance, yet holds securities with significant P/E Ratios.
Clustering Summary:
Out of 340 securities, both methods clustered 331 similarly; 9 showed differences. Clusters aligned with industry, with PCA showing consistent results. Cluster Insights:
High-performing cluster (25+ securities) led by Health Care, Consumer Discretionary, and Information Technology. Historically underperforming cluster (25+ securities) mostly from Energy. Moderately aggressive cluster (~10 securities) led by Financials. Large mildly aggressive cluster (270+ securities) diversely spread across Industrials, Financials, Consumer Discretionary, Real Estate, and Information Technology. Recommendations:
Four clusters offer diverse investment approaches based on aggression and performance. Market volatility underscores the need for ongoing analysis and dynamic clustering to refine predictions.